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ABSTRACT 


An automatic, language-independent syntax error detection, recovery, 
and correction system for LR(k) grammars is proposed. The requirement 
is made that the reverse of the grammar involved is. also LR(k). The 
implications and justification for this requirement are discussed. 
Given that the grammar is both LR(k) and RL(k),. forward and reverse 
parsers localize errors and define left and right error context pro- 
viding a strong base from which error analysis may proceed. Possible 
deterministic and heuristic corrective actions to follow error analysis 
are presented. The definition and selection of keys. from the set of 
terminal symbols for the grammar which enable the reverse parser to be 
engaged upon error detection are discussed. 

A model of the proposed system, implemented in an XPL compiler for 
a large ALGOL- like grammar, is described and the results of test 
programs are exampled and discussed. 

Possible extensions to the system are presented and areas requiring 


further analysis are defined. 
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І. INTRODUCTION 


Most compilers and compiler writing systems have some kind of error 
detection and recovery mechanisms built in. Most provide a degree of 
error analysis and indicate to the user an error type and an approximate 
position of the error in the input stream. Diagnostic messages range 
from a reference number to full statements of suspected cause followed 
by. parse histories. The suspected error symbol may be flagged with a 
pointer or referenced by name or both. Some error analysis systems are 
even sophisticated enough to specify the error symbol exactly and state 
the: correction necessary. 

If an error can be located precisely and defined without ambiguity 
then it seems logical that an immediate correction should be made and 
the processing allowed to continue. Іп general, it would seem that the 
more exactly an error could be defined the more efficiently the user's 
and. computer's. resources would be utilized. 

Research indicates that despite appreciable effort, attempts to 
design comprehensive error processing systems to accompany the 
increasingly popular mechanical compilers, translator writer systems 
(TWS),. and compiler-compilers has not been very successful. The error 
processing systems that do exist range from extremely simple recovery- 
only schemes to fairly complex attempts at error correction. 

It. is: proposed that an efficient automatic syntax error processing 
system for LR(k) grammars can be defined. . The system will operate as 
a function of a. grammar only, its parameters being defined by the 


grammar analyzer and the grammar parsing function. 





The objectives of such a system would be (1) to detect as many 
syntax errors: as possible. Recovery systems that simply delete code 
to some. predefined symbol do not afford the programmer maximum exposure 
af his code. to the analytical processes, (2) to detect errors as early 
as possible to enable a: more tenable recovery/correction scheme. 
Perhaps. one of the:most unsettling errors are those diagnosted as "NO 
PRODUCTION APPLICABLE." This type of error is generally associated 
with the precedence parsers and is the case of symbols being pushed 
onto the: stack after having been interpretted contextually correct 
locally. The error is. discovered when a subsequent symbol requires a 
reduction of the symbol stack and the error symbol does not fit any 
production definiton, (3) to make as many viable corrections as possible 
so as to allow continuous scan for maximum error detection; only as a 
last. resort delete code to affect recovery, (4) to avoid generating new 
syntax errors. by either correcting the error or affecting a complete 
recovery.. The inefficiency in correcting an error (or worse, recovering 
from one) only to alter the code so as to create another syntax error 
is: evident,. (5) to avoid passing errors into the parse stack. This 
condition gives rise to the difficulties of having to "undo" emitted 
code, and (6) to define errors as exactly and completely as possible 
if only to provide more meaningful diagnostics should the error 
correction attempt. fail. 

The. error correcting system will be defined to operate in an XPL 
compiler for LR(k) grammars whose reverse is also LR(k) and will be 
capable of correcting detectable error sequences of n symbols where n 
would be fixed when the compiler was constructed. For grammars meeting 
this restriction, forward and reverse LR(k) parsers can be defined and 
will be employed to localize errors and define error context. 
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Тһе left context of an error is defined by normal LR(k) parsing of 
the input.stream. The right context is defineable by employing the 
finite state machine representation of an LR(k) parser. Key symbols 
that uniquely define states in the FSM are selected from the set of 
terminal symbols for the grammar. When an error is detected, the next 
n symbols are.ignored and the input code following the error sequence 
is scanned for a:key symbol. When a key is located, the reverse parser 
is: engaged to parse back to the error sequence. The right context 
thus. defined,coupled with the left context provided by the forward 
parser forms a base from which error analysis may commence. Error 
corrections are defined by generating symbol strings of length n and 
comparing them with the error sequence. 

The: effectiveness of the system will be demonstrated by implement- 
ing. the. procedure for a non-trivial ALGOL-like language. The system 
was restricted from accessing the LR(k) parse stack. Though broad 
classes: of errors are correctable, this restriction defined a small 
set. of errors that is not easily corrected. For the event that the 
error could not be corrected deterministically, the error analyzer was 
defined to. always heuristically select a symbol for insertion or 
replacement as an attempted correction. In this situation, the analyzer 
would continue to manipulate the symbol sequence between the forward 
parser and the key in an attempt to achieve correct syntax but there 
were. cases where the resulting correction became unrealistic. Hence, 
it was necessary to place a restriction on the number of heuristic 
attempts that would. be made to correct the error. The process was 
aborted if a complete correction was not affected in this many attempts, 
code was delected through the key, and forward parsing was restarted 


at the symbol following the key. 





ІІ. CURRENT SYSTEMS 


As early as 1963 the need for automatic error analysis and correc- 
tion systems to be part of syntax directed compilers was recognized. 
Efforts toward the accomplishment of this goal resulted in the o 
of systems with capabilities ranging from simple recovery to fairly 
complex recovery and correction. A sample of the spectrum may be found 
in considering briefly the works of Irons, McKeeman, Leinius, LaFrance, 


and Rich. 


A. IRONS 

Irons [5] designed a parse algorithm which was guaranteed to manipu- 
late an input stream until it was syntactically correct for some defined 
grammar. Briefly, the mechanism involved carrying out all possible 
parses simultaneously. An error condition was defined when none of 
the current parses could continue. Error recovery and correction 
involved discarding the input stream from the error symbol until a 
symbol was found that would be syntactically correct for one of the 
existing parses. A string of symbols (including the null string) that 
would permit the selected parse to continue was then generated and 
inserted at the error point. Irons claimed the algorithm to be 
"relatively" efficient in terms of space and time requirements. 
However, it is conjectured that the algorithm would not be competitive 
in terms of space and time requirements if it was used on a larger 
grammar for a user-oriented language. 

The algorithm accomplishes error correction but at a rather 


primitive level as it operates on the very simple mechanism of deleting 





code rather than making any attempt to analyze the error - relative to 

its total environment. No attempt is made to ascertain the extent of 

the error or its total local context. For example, if a missing punctua- 
tion symbol following a statement constituted the error, then it is 
highly probable that a correct following statement would be deleted in 
the search for the punctuation mark. Automatic wholesale code deletion 
such as this .is a fairly severe price to pay for error correction, 


particularly when program logic may be destroyed. 


В. McKEEMAN 

In Reference 10, McKeeman examples the simple extreme. When ап 
error condition occurs, the input stream is scanned for an obvious 
"Stop" symbol for the language; the semicolon was used in the reference. 
The.interim code, including the error condition, is deleted and parsing 
is.re-initialized at the stop symbol. 

The advantages to such a system are obvious--it is easily and 
efficiently implemented, it is fast, and it does not create any new 
syntax errors. M However, as there is no attempt to correct an error, 
there is no possibility of executing. Additionally, the programmer 
also loses the opportunity to have all of his code scanned for syntactic 
continuity. 

Example: IF...e4...THEN IF... e, ... THEN... ; 

Error es. will not be found in the process of deleting code between 


enror e] andthe" semi coton 


Cee LEINIUS 
Working with the LR(k) grammars, Leinius' parser constructor 


defines a set of right context symbols to be used for error recovery 
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for each partial parse existing in each state of the parser [9]. Locat- 
ing: a. member of the set in the input stream allows the completion of a 
partial parse and the resultant reduction to be made. When an error 
symbol is read, a choice of recovery procedures is offered. The symbol 
string may be immediately scanned for one of the currently applicable 
right:context symbols or the stack may be searched to determine if the 
symbol just read is a right context symbol for some partial parse 
existing deeper in the stack. If the stack search fails then a decision 
must-be made as to the state in which scanning should commence to locate 
az right.context symbol. This. system is a more refined attempt at error 
recovery as the right context symbol offers a more local choice than 
simply scanning ahead to a stop symbol. But the system closely 
parallels’ Irons’ in that it is also possible that wholesale deletions 
can’ take place while scanning for a required symbol. More important, 
however, more syntax errors can be generated. 

Example: (X * e1 (X * X)) 

The.second left parenthesis will be deleted while scanning to the 
right looking for the required "X" with which to replace the error e} 
and: that deletion will obviously create an additional error when the 
parser attempts to read the second right parenthesis at the end of the 


string. 


D.. LaFRANCE 

LaFrance's error correction system employs groups of Floyd pro- 
ductions redefining a BNF language with necessary error productions 
build.into.the groups [8]. The error correction mechanism is based 
essentially on pattern matching. For errors involving unique productions, 


that is productions that require no context check, the symbol at the top 
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of the stack and/or the next input symbol are manipulated in accordance 
with an ordered set of transposition, insertion, and deletion rules. 
Otherwise, the applicable productions are expanded to three symbols 
ahead. These triples are then compared against the next four symbols 
from the input stream to find a match in a set of twenty patterns which 
defines a correcting modification to the input stream. If no match is 
found, the input stream is scanned until a symbol is located which will 
permit completion of a partial parse and control is then passed to the 


appropriate group and processing continues. 


БО RICH 

Rich [11] performed some preliminary work on an error correction 
system for mixed strategy parsers based on a scheme suggested by 
Gries [3]. It involves using legal triples to correct an error. A 
legal triple is an ordered, syntactically correct set of three terminal 
symbols for the grammar. The triples would be applied to the symbol 
prior to the error and the error symbol or the former and the symbol 
fóllowing the error for errors restricted to single symbols. In this 
manner a required deletion, replacement, insertion, or transposition 
would be defined. 

Rich anticipated that error correction attempts would have to be 
limited and that such a system would require provisions to facilitate 
recovery from an error correction that was found to be wrong. This 
would entail saving all parsing information at the point of the error, 
perhaps in the form of a temporary parse stack operating locally in 
parallel with the main stack. More important, provisions for a means 


of cancelling any code emitted during an aborted error correction could 





be required. Rich suggested that if a correction could not be applied 
then a unit of code (e.g., <STATEMENT>) would be deleted and a pseudo 


statement (e.g., a diagnostic message) substituted. 





МИГ AO POS EDS SEA 


The basic mechanics of the system were initially conceptualized as 
involving analysis of the input string following a syntax error. This 
analysis coupled with that which had preceded the error would provide a 
more cohesive context in which to analyze the error thus enhance error 
localization and definition and increase the probability of selecting 
the most applicable correction. Error analysis in this environment 
would be more definitive than schemes involving matching patterns of 
terminal symbol strings or extrapolating possible inputs from the 


analysis available prior to the error. 


A. . LR(k) GRAMMARS 

The LR(k) grammars were selected as the class to which the system 
would apply as they and LR(k) parsing enjoy several advantages over 
Simple and mixed strategy precedence (MSP) techniques: (1) the class of 
LR(k) grammars includes the precedence grammars, (2) the LR(k) parse 
Stack provides an accessible and complete parse history to any point 
during processing of the object string. This deterministic context 
should permit more confident error analysis, and (3) all syntax errors 
are detected in read or lookahead states in the form of "ILLEGAL SYMBOL 
PAIRS," thus, the LR(k) parse stack is syntax error free. 

LR(k) parsers may be represented by a characteristic finite state 
machine (CFSM) [2] which consists of two essential active states--read 
and lookahead. The lookahead states are required to resolve stacking/ 
reduction decisions; that is, the next k symbols in the input stream 


define sufficient context to resolve the local conflict. Associated with 





each state in the FSM is a unique accessing symbol. The accessing 
symbol is the terminal or nonterminal symbol from the grammar that has 
caused the recognizer to enter that state. In Figure 1, the nonterminal 
symbol <Block Body> is the result of a reduction made to a portion of 
the symbol stream already processed onto the parse stack and is the 
accessing symbol for read state 5. Read state 5 causes the next symbol 
in the code stream, sy , to be read. If 5 | is the symbol END then a 
transition is made to reduce state 8, if 51 is a semicolon then a 
transition is made to read state 36. These two symbols then become 
accessing symbols for their respective transition states. Similarly, 
the symbols BEGIN, END, ... WRITEON are accessing symbols for their 
respective states following read state 36. The terminal symbols that 
are state accessing symbols will play a significant role in the proposed 
system and will be discussed below. 

The entire LR(k) parse stack is accessible and defines the complete 
parse history. As LR(k) parsing is deterministic, each new state is a 
unique transition from its predecessor. This deterministic trace 
through the FSM as a symbol string is processed, continuously confirms 
Syntactic continuity as each state is entered. Therefore, it is generally 
not necessary to access the entire stack to determine left context for 


a specific symbol. 


В. LR(k)/RL(k) GRAMMARS 

To achieve error isolation and definition of error context, the 
Stipulation was made that the grammar on which the error corrector 
would operate must be both LR(k) and RL(k). Then the construction of 


a LR(k) parser for the reverse of the grammar would enable bi-directional 
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analysis of an error in that both forward and backwards parsers should 
recognize a given error in a sentence from the language. 

It was fully appreciated that the above requirement was not 
insignificant. Knuth [7] discussed the LR(k)/RL(k) relation briefly by 
exampling a language for which a RL(k) grammar could be constructed but 
a LR(k) grammar could not, for any k. The specific problem that he 
exampled was encountered in the reverse situation in the grammar used 
in the model. Given the two ALGOL-E sentential forms: 

FUNCTION <ID>(<ID>,<ID>,...,<ID>); 
and 

READ (<VAR>,<VAR>,...,<VAR>); 


where <VAR> may be derived from <ID>, an input sequence: 


READ (AAA,BBB ,CCC) ; 


is deterministic when read in a left to right manner because of the 
differentiating reserved word READ, but is not LR(k) when read right 
to left, because the parser cannot decide whether or not to reduce the 
identifier to <VAR> until the symbol READ has been recognized. This 
ambiguity was resolved in the model grammar used in this research by 
changing the read list delimiters from parentheses to vertical bars. 
(Two other similar changes to the grammar were required and will be 
described in Section IV.) 

The cost of sacrificing minor user-oriented features should not 
necessarily preclude a language from more efficient processing 
techniques. Involved here is the sacrifice of minor symbology so as 
to permit automatic error processing of the grammar. Minor modifications 


of this same nature to specific grammars may enable the proposed system 
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to apply to a significant set of interesting languages. As the model 


grammar is not trivial, a valid example is provided. 


C. REVERSE PARSING | 

Error definition and correction was approached from the point of 
view that they involved essentially the analysis of an error in its 
environment; and that the probability of not making a mistake during 
analysis was a function of the magnitude of the error environment 
considered. Thus, when an error is detected, it becomes..necessary to 
read the input code that follows the error sequence and relate this 
right context to that on the left. In this manner the error would be 
localized. Pattern matching techniques, such as LaFrance's, accomplish 
this condition by projecting ahead all productions applicable at the 
point of error detection. This extension defines a set of all possible 
correct symbol patterns that may syntactically follow the last accepted 
symbol. LaFrance extrapolates all legal triples, thus, is able to 
correct most single and double symbol errors and, in some cases, triple 
symbol errors, particularly those involving reordering of the generated 
triples. 

Except in the last case, at least one of the symbols in the generated 
triple was used to define the right context of the error. When one or 
more of the symbols in the triple were matched by symbols in the next 
four symbols from the input stream, corrections were based on the 
interpretation that the error extended from the symbol at which parsing 
halted to the start of the matching sequence. 

It would be possible to also define right context by scanning the 


input stream to the end and allowing the reverse parser to parse from 





right to left. When the reverse parser stopped due to an error the right 
context would be defined to that point. This can immediately be seen to 
be a very impractical method. 

A means was needed to unambiguously engage reverse parsing at some 
intermediate symbol in the code stream beyond the error sequence. This 
would require the ability to uniquely define a parse state for that 
intermediate symbol. If a state could be so defined then, by the 
nature of LR(k) parsing, the parse history prior to that state could 
be inferred. Starting the reverse parser at an intermediate symbol 
would, in essence, simulate having parsed from the end of the code 
Stream to that symbol. 

If the symbol immediately following the error sequence could be 
determined and if this symbol was a FSM state accessing symbol, that 
is, it defined a unique state in the FSM, then an immediate transition 
to that state could be made. Associated with each read and lookahead 
state is a defined set of terminal symbols any of which is syntactically 
correct with respect to the accessing symbol for that state. When the 
transition was made, the reverse parser would be in a position to 
immediately reference the last symbol in the error sequence. 

In this instance, only ordered pairs, vice triples, would be 
required for pattern matching as this is all that would be required to 
Span the error sequence. The savings made by having to construct one 
less level of a generation tree are immediately apparent. 

However, an immediate extension was suggested. If the symbol 
immediately following the error will define a unique reverse parse 
State then it may be possible to select any terminal symbol that so 
defines a State, find this symboi in the input stream, transfer to the 


appropriate state, and parse back to the error. 
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D. KEYS 

The determination of symbols or keys that uniquely defined a 
reverse parser state was predicated on several requirements. Certain 
required attributes of a key were easily defined: the key should not be 
part of the error and it must appear in the code stream. 

To ascertain that the keys were located outside of the error sequence 
required restricting the maximum length of the error sequence to n 
symbols. Then scanning for the keys would commence n symbols after the 
point of error detection. 

The stipulation that the key must appear required that at least two 
symbols be designated keys. The first would be some symbol from the set 
of terminal symbols for the grammar and, to provide for the case where 
Ши symbol 1S not present in the balance of the input stream, the 
second symbol would be that used by the grammar to signify end-of-file. 

Also, while keys could be located well beyond the error sequence, 
they should be .located close enough so as to minimize the probability 
of encountering a second error while parsing back to the first. 

A key that would specify a state in which reverse parsing could 
commence was only sufficient for reverse parsing. To provide for the 
case that the error was not correctable, it was also necessary that 
this key specify a state to which the forward parser could be transferred 
and restarted. 

If the grammar is structured, as is the model grammar, then keys 
may be suggested by the delineators of the basic recursive forms. The 
basic form of an ALGOL-E sentence was quickly discerned as the terminal 
symbol BEGIN followed by any number of <Declaration Set>'s, each 


delineated by a period, followed by at least one <Statement>, with 
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semicolons separating multiple <Statement>'s, followed by the terminal 
symbol END. The period, semicolon, and END were considered as possible 
keys. 

A grammar analyzer with which to define keys was not designed; 
however, a semi-mechanical analysis process was defined and applied to 
the model grammar. 

After excluding the symbols <Identifier>, <Number>, and <String> 
from the model grammar, the intersection of terminal accessing symbols 
defining read states in the two parsers was found. The set contained 
only ".", ";", the set of arithmetic operators, "OR", "AND", "(", and 
"t" thereby eliminating END from the tentative list. Applying 
intuitive arguments, the set was further reduced. 

All of the symbols, less the period and semicolon, were dis- 
qualified because they need not appear regularly in a code stream and 
they defined illogical potential celetion units. Long strings of code 
between the error sequence and the key would increase the probability 
of encountering a second error thereby causing the error correction 
attempt to be aborted and all code to the key to be deleted. An 
illogical deletion unit would be exampled by using the reserved word 
AND as a key and the attempt at error correction failed. Though it may 
be possible to delete code between two AND's and preserve syntactic 
continuity in the remaining code, intuitively, that deletion would 
violate the basic structure of the language. 

The period and semicolon appeared to have both desired attributes. 
Judging from the language, both occur fairly regularly and more important, 
the strings of the code between either and an error are of manageable 


lengths. 
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Additionally, the left parenthesis and the add and subtract signs 
defined multiple states in which the reverse parser could be started. 

It. would be a simple matter to scan for (say) a left parenthesis, but 
it would not be readily apparent in which state the reverse parser 
should be started to process code back to the error. 

Though a simplistic approach, this general analysis of the grammar 
Suggested several variants and extensions to the definition, employment, 
and effects of keys. 

For example, the left parenthesis was found to be an accessing 
symbol for six read states in the reverse parser, three of which were 
independently unique. (The grammar analyzer employed to construct the 
parser was not designed to remove redundant states in the FSM, which 
it is possible to do.) As one of the prime objectives was to remain 
close to the error so as to avoid second errors as much as possible, it 
was seen that it could be significantly beneficial with respect to error 
correction capability to assign a symbol such as the left parenthesis 
as a key. The parenthesis is an often used symbol and its being 
designated a key would enable, in many cases, Scanning and processing 
Shorter strings of code. Resolution of the ambiguity created by the 
multiple states defined by the key could be accomplished by providing 
for variable path parsing via a system such as Irons'. That is, start 
the reverse parser in each state defined by the key and allow it to 
return to the error. It may be the case that an increased selection 
of possible error corrections may evolve, thereby enhancing the system's 
overall ability. | 

Secondly, only those symbols defining read states were considered 


for the model; however, it could be of benefit to not restrict key 


22 





selection to only that case. Through grammar analysis it may be 

possible and practical to define more valuable keys by considering 
those terminal symbols that define lookahead and reduce states in 

addition to read states. 

In fact, a natural extension of the preceding discussion might be 
to consider only the symbol immediately following the maximum error 
sequence and allow variable path parsing :back to the origin of the 
error.. However, in the event that error correction failed, the 
problems associated with error recovery would remain to be resolved. 

It would be highly probable that the sequence of code between the error 
Bunt and the key would not be a convenient string to delete. One 
possible solution would be to delete code to the first available key 
that did define a logical deletion unit. 

Consideration of the above possibilities was doubly motivated. 
First by the objective to keep keys as close to the error as practical, 
and second, it was surprising to find a set of fifty terminal symbols 
so severely reduced when the subset of those symbols defining read 
states in both parsers was determined. It seems very likely that there 
may be interesting LR(k) grammars that would be excluded from the pro- 
posed system by restricting the definition of keys to those symbols 


that mutually defined only read states between the two parsers. 


E. PROCEDURE 

When an error is detected in either a read or lookahead state, 
the corrector procedure requires stepping over n symbols to insure that 
the key selected is not imbedded in the error string, scanning forward 


until a key is encountered, and engaging the reverse parser in the 
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state prescribed. The reverse parser is allowed to parse backwards 
until it either stops at the same point at which the forward parser 
stopped or is stopped due to encountering an error. If the length of 
the symbol string between the two parsers is greater than n then the 
restriction on error magnitude has been violated, code will be deleted 
to :the key and the forward parser will be restarted at the symbol 
following the key. If the number of symbols between the two parsers 
is:equal to k, l<= k<=n, then symbol strings of length k are generated 
from the context of either parser and, via a set of pattern matching 
rules such as those defined by LaFrance, the generated strings are 
compared with the error string and either symbol deletion, insertion, 
replacement, and/or transposition will be defined. If k is equal to 
zero ‘then the reverse parser has returned to the symbol recognized as 
an error by the forward parser. The error may be quickly resolved by 
intersecting the symbol sets associated with the two parse states 
thereby defining a replacement symbol. Or deletion may be defined by 
determining that both parsers would be satisfied by the symbols that 
follow the error relative to either parser. 

In the case that the reverse parser is not in an error condition 
while reading the forward parser error symbol, an insertion symbol may 
be defined by intersecting the parse state symbol sets after stepping 
the .reverse parser to its next read or lookahead state. 

In the event that all deterministic error correction attempts fail, 
it may be advantageous to heuristically select a symbol from the forward 
parser symbol set to either replace or be inserted in front of the error 
symbol and restart forward parsing rather than automatically proceed 


with code deletion. 
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At the cost of the extra processing time required, a neuristic 
attempt to correct an error would serve two purposes. It may provide 
the necessary impetus to complete the correction or, even 1f the attempt 
failed, it should define to the programmer an approach to correction 
through the associated diagnostics. 

Consider the case where the allowable error magnitude is one symbol 
and the error is actually the omission of two symbols. For example: 

Же БИ РТЕШНЕК. 222 
where the symbols "Z;" have been omitted. The forward parser will 
detect an error when it attempts to access the symbol IF and the 
reverse parser will detect an error accessing the plus sign. Neither 
parser may be satisfied by any deletion of adjacent errors, nor by the 
transposition of any symbol pairs. Also, the intersection of the symbol 
sets associated with each parser state will be empty, thus an insertion 
or replacement symbol will not be deterministically defined. A 
heuristic attempt to correct may be made at this point by selecting 
a symbol from the forward parser symbol set for insertion in front of 
the error symbol (the error symbol is the word IF for the forward 
parser.) 

Obviously, by inspection, a choice is available. The selection 
would certainly include a nunber, another identifier, and a left 
parenthesis. Two of these three symbols would effectively reduce the 
remaining error to a single symbol and permit the deterministic processes 
to re-analyze the error. 

If the left parenthesis was selected then the gains are not so 
obvious. On the next analysis iteration it is probable the deterministic 
attempts would again fail. Heuristically, however, another symbol would 


be inserted. 
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How symbols are selected from applicable sets is also variable. 
Whether they are selected as they are ordered in the set or in reverse 
order may be problematic. However, a means to avoid issuing a duplicate 
of the previous choice would probably be required. 

In the manner described and within the confines of error restric- 
tions, the proposed error corrector accomplishes error detection as 
early as possible and defines error processing such that the error is 
not promulgated to the stack. A strong deterministic attempt will be 
made to correct an error and failing that, a heuristic choice of 
correction will be applied. 

Two other facilities would be required to support the proposed 
system: (1) an upper limit to the number of heuristically selected 
corrections that would be made forany one error must be specified. 
Only when this limit was reached would code be deleted, and (2) 
complete communications are 2 with the programmer to insure 
that, in the event error correction failed, the diagnostics would 
provide a complete history of corrector action helping to isolate, 
and perhaps allowing the user to quickly discern the true cause of 
the error. 

The case that the key symbol had been missplaced and in itself 
constituted an error required consideration. No problem would arise 
if a key was located in the allowable error string as this string would 
not be considered when scanning for keys. If the key was erroneously 
placed beyond the error string then the error restrictions would be 
violated; however, the violation would not be detected until the code 
Sequence between the key and the following key was processed. The 


corrector would not recognize an erroneous key in itself; hence, 
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correction procedures would be applied to both strings of code, that 
preceding the error key and that immediately following. 

The: possibility of defining symbol strings vice single symbols as 
keys. to alleviate the problem of keys being in error was considered. 
Again,. these considerations were also motivated by the desire to place 
the keys: as close. to the error as possible to preclude encountering 
second errors. 

It. may be possible to define ordered sets of terminal symbols such 
that. their being located in the input stream would specify a unique 
start state for the reverse parser whose accessing symbol would be one 
of’ the elements of the set. For example, if the string <Operator> ( 
<Identifier> uniquely defined a reverse parser state such that the 
accessing symbol was a left parenthesis, then the location of this 
string following an error may preclude the requirement to scan further 
for a semicolon or period. Thus, the possibility of encountering a 
second error while reverse parsing would be reduced. 

The: above concept of keying on symbol strings may be extendable to 
enable the forward parser to perceive or extrapolate symbol sets based 
on the state it was in when the error was recognized and the left 
context, that, if located in the code stream following the error, would 
define unique start states for the reverse parser. It may be possible 
to define a. set or hierarchy of such strings through a complex analysis 
of the forward and reverse parser interface. Continuing the example 
above, for a: given forward parser state there may be several contexts 
in which a left parenthesis may be taken such that each uniquely 


defines a reverse parser start state. 
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Not locating such strings following an error would not necessarily 
constitute a second error and would require that hierarchical sets such 
as these also include any "primary" keys defined for the grammar, such 
as the period and semicolon previously discussed. If the forward 
parser was currently parsing an <If Statement>. for example, and 
locating the reserved word THEN would enable engaging the reverse parser; 
not locating that key should not automatically constitute a second error. 
That particular key may be involved directly in the detected error 
Sequence and scanning should continue, searching for the next defineable 


key in the key set for <If Statement>. 
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IV. IMPLEMENTATION 


For the purpose of implementation of the error recovery system 
defined, considerations were restricted to those syntax errors involving 
only single symbols and transposition of symbol pairs. Extensions of | 
the system to include errors of greater complexity and scope will be 


discussed at the conclusion. 


A. COMPILER 

A basic model of the proposed error correction system was implemented 
in an XPL compiler for ALGOL-E, a non-trivial ALGOL-like language (134 
productions, 50 terminal symbols, 74 non-terminal symbols). A listing 
of the grammar is provided in the Appendix. The model is semmantics 
independent, its parameters being solely derived from the forward and 
reverse parsers, i.e., parse states and associated symbol sets. 

The compiler was constructed from an existing ALGOL-E compiler 
employing MSP parsing [6] and an XPL skeleton compiler written by 
DeRemer [1] for his SLR(k) parser. Figure 2 shows some of the detail 
in the construction of the hybrid model compiler. Studies have shown 
that the SLR(k) parser constructor and the resulting parser to require 
significantly less space and time than the MSP parsers [2,4]. This was 
also found to be the case in this application. The SLR(k) parser for 
ALGOL-E required approximately 64 percent of the space required for the 
MSP parser for the same grammar. This was considered significant as 
the error correction technique to be implemented would require both a 


forward and a reverse parser. 
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The SLR(k) parser constructor was defined and implemented by DeRemer. 
The gained efficiency of his system over other basic LR(k) parser con- 
structors was achieved by constructing a LR(0) parser for the- grammar 
then adding lookahead states only where they were needed. This approach 


resulted in faster construction and reduced parser size. 


В. GRAMMAR 

The ALGOL-E grammar [6] was found to be not SLR(1), as was also the 
case for the reverse grammar. The required changes to the- grammar were 
essentially minor and did not detract from or enhance the language. It 
was necessary to change the delimiters in a read statement from paren- 
theses to vertical bars and the ambiguity of the ALGOL assignment symbol, 
:-, Was resolved by defining a new terminal symbol:<Setq>, <Setq> is 
transparent to the programmer as are <Identifier>, <Number>, and 
<String> and is similarly assigned in procedure SCAN of the compiler. 
Additionally, procedure calls were differentiated from function calls 
by requiring the reserved word CALL to precede the name of the subroutine. 
It was also necessary to delimit <Declaration Set> with periods vice 


semicolons. 


C. SPELLING CORRECTIONS 

Emperically, misspelled identifiers and reserved words forma signif- 
icant percentage of errors; therefore, after appropriate modification, 
a spelling checking system was incorporated into the compiler [11]. An 
attempted error correction would fail if the reverse parser failed to 
return to the point of the input stream at which the forward parser was 
halted, hence, it was necessary to also enable spelling correction of 


misspelled reversed words in the reverse parser. Only reserved words 
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are pertinent to the reverse parser spelling checking procedure as 1t 
is concerned with only the syntax of identifiers, not the semantics, 
i.e., spelling. The spelling checking procedures incorporated were 
simplistic but demonstrative; only those errors involving one deleted 
or added character, one character in error, or two adjacent characters 
transposed were correctable. 1. the complexity and sophistication 
are easily extended if one is willing to absorb the additional cost in 


terms of space and time. 


mee PROCEDURE 

The model consists of two primary procedures, ERROR ANALYZER and 
REVERSE PARSER (Reference Figure 3). CAN DO WITHOUT TOP, FP_INSTRSCT RP, 
and CHECK CONTEXT ОҒ ТОР АМО TOKEN are called from ERROR ANALYZER to 
determine if a symbol is a member of an applicable symbol set or to 
determine the symbol in the intersection of the applicable symbol sets 
of the forward and reverse е e Тһе applicable symbol 
sets are those read and/or lookahead symbol sets for a particular 
forward or reverse parse state. Procedures TRANSPOSE, REPLACE, DELETE, 
and -INSERT are called when a tentative error solution has been deter- 
mined and the action implied by the procedure names is to be applied to 
the symbol at the top of the stack and/or the token symbol (next symbol 
to. be read). 

As in the case of spelling correction, the scope of errors was 
restricted to single symbol insertion or deletion, one symbol in error, 
or two adjacent symbols transposed. 

Error analysis was restricted to only that symbol on the top of the 


stack and/or the token symbol. This restriction was imposed to 
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preclude having to delete code that may have been emitted with the 
possible reduction of the second symbol in the stack prior to detecting 
the error. Further, the heuristic choice was made to first test for 

the possibility of deleting the error symbol. This was to reduce the 
occurrences: of having to define a <Number>, <Identifer>, or <String> 
should the case be that the error was caused by any one of those omis- 
sions.. For example, if X:=Y++Z; was the input string then one of the 
operators would be deleted vice inserting either <Number> or <Identifier> 
or: any other expression. 

For purposes of implementation, the period and semicolon were 
defined as the primary keys for all cases. EOF was designated the terminal 
Кеу.. The period was used as the primary key when the syntax analyzer 
was parsing declarations (reference ALGOL-E (Modi fied) grammar listing) 
and semicolon was the primary key elsewhere. 

When the forward parser is stopped by an error condition it is in 
either a read or a lookahead state and either the two top symbols on 
the stack or the. top symbol and the lookahead symbol will constitute 
an illegal symbol pair. At this point, the history of the finite state 
machine for the grammar is known or may be determined directly from the 
current parse state and the set of read or lookahead symbols associated 
with that state. That is,.given a symbol from the current applicable 
set,. either: the symbol will be stacked, indicating that the right part 
of some. production is one symbol more complete, or the symbol just 
looked at will specify that the right part of a production has been 
completely read. and a corresponding reduction will be made in the stack. 
The result of that reduction will in turn specify another symbol (a 
production left-part) toward completing the right part of some 


production entered further down in the stack. 
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If the error symbol cannot be corrected as a misspelling then the 
error analyzing mechanism is engaged. Symbols are read into a symbol 
stack while the input stream is scanned for a key. The reverse parser 
is initiated in the state specified for the key and, operating with its 
own state stack, processes the symbol stack in reverse until it is 
stopped by an error that it cannot resolve as a misspelling or it 
reaches the point in the code stream at which the forward parser 
Stopped. For example, reference Figure 4. 

Figure 4(a) depicts the configuration of the forward parsing stacks 
when an error (e) has been detected. The symbol e represents an error 
sequence of length n or less. If NEXT SYMBOL(SP) is «Identifier» and 
is determined to be a misspelled reserved word then the correction is 
made immediately and parsing resumes; otherwise, the point of progress 
of the parse stack is marked (SAVE SP, Fig 4(b)) The input stream is 
read to the key and the reverse parser is started in the read state 
for that key (R STATE STACK(RP)). 

Figure 4(c) depicts the configuration of the stacks after the 
reverse parser has successfully parsed back to the error point and 
error analysis and correction begins. (Note: pointers SP and SAVE SP 
have been interchanged for compiler execution considerations only.) 
When forward parsing resumes after error correction, symbols through 
the key are read from stack NEXT SYMBOL. Only then does the parser 
return to reading the input stream. If the error cannot be resolved or 
the reverse parser is halted short of error e by additional errors then 
the code from the error to the key (NEXT SYMBOL(SAVE SP)) is deleted. 

Figure 5 depicts various configurations the two parsers may be in 


when the reverse parser has stopped. In conditions 5(i) and 5(j) the 
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errors are defined to be too far apart and symbols e y to key are 
deleted and forward processing is re-initialized at the semicolon. 

The conditions depicted in Figures 5(a) through 5(h) fall within the 
scope-of-error restrictions imposed and error analysis may be performed. 
Note that in configurations 5(a), 5(b), 5(e), and 5(Ғ) the reverse 


parser may or may not be in an error state,.1.e.,.symbol e,. may be 


1 
syntactically correct as the left context of а. 

For configurations 5(a) through 5(h), symbols а... 84: 4.65, and 
Я are checked against the read or lookahead symbol sets for the forward 
and reverse parser states so as to make an appropriate deletion, inser- 
tion, or transposition. If the error cannot be so resolved then a 
Symbol is heuristically selected from the applicable forward parser 
symbol set without reference to the reverse parser and inserted in front 
of the error symbol. This heuristic. approach may be applied four times 
before code will be deleted. Control is then returned to the forward 
parser. 

Example 1: Configuration 5(a) 

Both the forward and reverse parsers are in read states after read- 
ing symbol е. Let the forward parser be in state iy and the reverse 
parser be in state Py Let fss, be the set of symbols associated 
with the forward parser read state f, and similarly, rss, represents 
the symbol set for "ғұ. 

If а, 15 а member of fss, and а. 1S a member of rss, then 
delete e7 and continue normal processing. 

If the reverse parser (RP) is not іп an error condition then step 
RP to its next read or lookahead state (Ғұн ). If the intersection of 


fss, and г55 41 1s empty then replace e] with the intersection of 


37 





Ы 


4 


LITE Ep 1 


а j 
z а) 
2 сә; 
аз G Ф Ф 
> ДЕ o > 
SEN 
ПЕЕ 1 
© © < 





Figure 5 


38 








fssy and rss, (if this intersection is also empty then replace е. 

with a symbol from 155, ) апа continue processing; otherwise, insert the 
intersectionof БЕЙ апа rSSk+] in front of е, and continue. (Note: 
That the reverse parser may not be in an error condition when it reads 
the symbol causing the error for the forward parser is very pertinent 
to the error analysis process. If it is the case that it is not in 
error then the initial assumption is that a symbol is missing in front 
of. the error symbol. With that assumption made, a symbol that 15 
syntactically correct for both parsers is required for insertion in 
front of the error symbol. This 1s accomplished by stepping the reverse 
parser to its next read or lookahead state, which ever occurs first. 

The insertion symbol is then taken from the intersection of the symbol 
sets associated with the two parse states.) 

Otherwise, (RP is in an error condition), replace e, with the inter- 
section of the FP and RP read state symbol sets if that intersection is 
not empty (if that intersection is empty then insert a symbol from SS ) 
and continue processing. 

Example 2: Configuration 5(d) 

Both of the parsers are in error conditions, the forward parser (FP) 
is in read state f, and RP is in lookahead state ry . Again, let 
fss, апа rss, Бе the symbol sets for the respective parse states. 

If. ej is a member of fss, and е, is a member of rss, then 
transpose e, and e, and continue processing. 

If. the intersection of the two symbol sets is not empty then 
replace e, with a symbol from fss, ., delete ej, and continue. 


If e, is a member of i Ser then delete е. and continue. 
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Otherwise (attempt the last resort), replace е. with a symbol from 


fss, and continue processing. 


ERES RESULTS 

Figure 6 examples some of the results of the: error correcting system 
described. Generally, the system recognized a broad class of single 
symbol errors of insertion and omission and double symbol transposition 
errors. However, there was one small, well-defined class of error that 
though recognized, could not be corrected while retaining the imposed 
restriction of not modifying the parse stack below the top symbol to 
achieve an error solution. 

For those constructs in which a statement was. started with a 
reserved word followed by an identifier, the omission of the reserved 
word was not detected until the symbol following the identifier was read 
as it is syntactically correct for statements to also start with an 
identifier. In this instance, the true error point was two symbols from 
the top of the parse stack when the error condition was recognized. 

In the case where the reserved word was not omitted but merely 
grossly misspelled such that the symbol was interpretted as an 
identifier, the error condition arose when the following identifier was 
read. In this instance the true error point was one down from the top 
of the stack. 

For both situations, the omission and misspelling of the- reserved 
word, by the time the error was discovered, the identifier following 
the error had already been reduced and associated code emitted. 

For the class of error conditions that was processed correctly, 


most conditions were corrected in a logical manner; logical in the sense 
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that the corrections made were those that a human reader would be 
expected to make. A few configurations were made syntactically correct 
but not in the logical sense defined above. 

Example: FOR A := PETS 1 1 UNTIL... 

PETS, not recognized as a misspelling of STEP, was interpretted as 
an identifier resulting in the first "1" being replaced by STEP. 

For the case of self-embedding symbol pairs such as BEGIN...END and 
(...), the omission or duplication of the leading or left symbol 
resulted in the ee or insertion of the right symbol at a later 
point in the input stream. At first brush, this particular correction 
may seem fairly gross but the delection/insertion points were syntactically 
defined without regard for what ever the programmer's intended logic may 
have been. 

For those errors that the system could not correct, the history of 
the attempts at solution prior to abandoning the error and deleting code 
and a definition of the last error encountered by the reverse parser 
were made available to the programmer, thereby fairly isolating the 
error and defining the inability to make a correction. 

The .time involved in correcting errors averaged about 0.015 seconds 


per error for the programs tested. 
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V. CONCLUSIONS 


The syntax error correcting procedure proposed in this thesis is a 
viable system. While costs in terms of time and space are involved, 
its effects on a user's code are considerably more attractive than those 
of popular recovery systems employing automatic deletion of code to some 
stop symbol. Whereas the proposed system was defined to be grammar 
ES endent, the working model implemented was semi-automatic, using 
predefined start states for the reverse and forward parsers. It is 
recognized that these crossover points are significant with respect to 
fully automating the error correction process; however, they are the 
only points in the model that are language dependent. The correction 
procedures themselves are language independent; their only parameters 
are parse states and associated symbol sets defined by the parser 
constructor. | 

The power in the procedure is attibutable to the LR(k) parsing 
employed. Errors are examined in a very large context provided by the 
two disjoint state stacks of the forward and reverse parsers. Through 
LR(k) parsing, syntax errors are detected as the input stream is read 
and are precluded from the symbol stack. 

The model demonstrates that the proposed system detects and deter- 
ministically corrects a large class of errors thereby affording the 
programmer maximum exposure of his code to the analytical processes. 

A strong heuristic attempt to correct is provided for those cases that 
the error cannot be resolved deterministically. Should error correction 


“. 


fail entirely, the system provides a good diagnosis and all residue of 
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the error is removed, thereby insuring against generating or cascading 


Syntax errors through the remainder of the input stream. 
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VI. EXTENSIONS 


The error correction system described in this thesis indicates 
several areas where worthwhile extensions can be made and where further 


analysis 1s necessary. 


A. KEY DEFINITION 

As keys seem to lend themselves to empirical definition then it 
would seem logical that they may be analytically defined as the grammar 
to which they belong is being analyzed. An analyzer capable of defining 
a set of valid keys should also enable automating the error corrector 
by associating keys with states for both parsers and providing an auto- 
matic link to a key and the engagement of the error analyzing system 
from any state in the parser when a syntax error is detected. It may be 
feasible and practical to define a. hierarchy of keys so that it would 
not be required to go beyond a minimum distance past the outer limit 
of the allowable error sequence. This would serve to minimize the 
likelihood of. encountering another error thereby causing the corrector 
to abort. 

It may also be of value to define a grammar analyzer capable of 
recognizing hierarchies of key symbols and symbol strings and associat- 
ing these sets with unique parser states such that, for a given senten- 
tial form, dedicated keys are available to minimize the key-to-error 
distance and increase the probability that a key itself does not 


constitute an error. 
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В. ERROR EXTENSION 

The current implementation severely restricts errors to single 
symbols except in the case of adjacent symbol transposition. A logical 
extension would be. to extend the limits to provide for multiple symbol 
errors.. This would require either predefining and storing the legal 
symbol strings or defining a. symbol string generator to be called as 


required. 


C.. CLASSIC LR(k) VERSUS SLR(k) 

The classic LR(k) parser stops whenever it encounters an error 
symbol in either a read or lookahead state. The parser employed in the 
model defaults to the next read state in the event that the lookahead 
symbol is: not a member of: the symbol set associated with a particular 
Tookahead state. That is, a successful lookahead defines a stack 
reduction,. otherwise the decision is to stack (read) the lookahead 
symbol via the next logical read 20 Only after the symbol is read 
is it determined that it is’an error symbol or not. It would be 
advantageous to. be able to stop the parser in a lookahead state rather 
than in the next read state so as to keep the symbol preceding the 
error readily accessible at the top of the stack and available to 


participate in error analysis. 


Ue SACK ACCESSIBILITY 

As inconvenient. as it-may be,.there are constructs in the grammar 
such that their containing errors: is undetectable until the point where 
correction is needed is in the stack. More analysis is needed to weigh 
the costs of incorporating a means of accessing the stack and, if 
necessary, deleting and regenerating code against the desire to and 


benefits of being able to correct this type error. 
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